Problem Statement

Prediction of bankruptcy is a phenomenon of increasing interest to firms who stand to loose money because on unpaid debts. Since computers can store huge dataset pertaining to bankruptcy making accurate predictions from them before hand is becoming important.

The data were collected from the Taiwan Economic Journal for the years 1999 to 2009. Company bankruptcy was defined based on the business regulations of the Taiwan Stock Exchange.

In this project you will use various classification algorithms on bankruptcy dataset to predict bankruptcies with satisfying accuracies long before the actual event.

Attribute Information

Updated column names and description to make the data easier to understand (Y = Output feature, X = Input features)

Y - Bankrupt?: Class label 1 : Yes , 0: No

X1 - ROA(C) before interest and depreciation before interest: Return On Total Assets(C)

X2 - ROA(A) before interest and % after tax: Return On Total Assets(A)

X3 - ROA(B) before interest and depreciation after tax: Return On Total Assets(B)

X4 - Operating Gross Margin: Gross Profit/Net Sales

X5 - Realized Sales Gross Margin: Realized Gross Profit/Net Sales

X6 - Operating Profit Rate: Operating Income/Net Sales

X7 - Pre-tax net Interest Rate: Pre-Tax Income/Net Sales

X8 - After-tax net Interest Rate: Net Income/Net Sales

X9 - Non-industry income and expenditure/revenue: Net Non-operating Income Ratio

X10 - Continuous interest rate (after tax): Net Income-Exclude Disposal Gain or Loss/Net Sales

X11 - Operating Expense Rate: Operating Expenses/Net Sales

X12 - Research and development expense rate: (Research and Development Expenses)/Net Sales

X13 - Cash flow rate: Cash Flow from Operating/Current Liabilities

X14 - Interest-bearing debt interest rate: Interest-bearing Debt/Equity

X15 - Tax rate (A): Effective Tax Rate

X16 - Net Value Per Share (B): Book Value Per Share(B)

X17 - Net Value Per Share (A): Book Value Per Share(A)

X18 - Net Value Per Share (C): Book Value Per Share(C)

X19 - Persistent EPS in the Last Four Seasons: EPS-Net Income

X20 - Cash Flow Per Share

X21 - Revenue Per Share (Yuan ¥): Sales Per Share

X22 - Operating Profit Per Share (Yuan ¥): Operating Income Per Share

X23 - Per Share Net profit before tax (Yuan ¥): Pretax Income Per Share

X24 - Realized Sales Gross Profit Growth Rate

X25 - Operating Profit Growth Rate: Operating Income Growth

X26 - After-tax Net Profit Growth Rate: Net Income Growth

X27 - Regular Net Profit Growth Rate: Continuing Operating Income after Tax Growth

X28 - Continuous Net Profit Growth Rate: Net Income-Excluding Disposal Gain or Loss Growth

X29 - Total Asset Growth Rate: Total Asset Growth

X30 - Net Value Growth Rate: Total Equity Growth

X31 - Total Asset Return Growth Rate Ratio: Return on Total Asset Growth

X32 - Cash Reinvestment %: Cash Reinvestment Ratio

X33 - Current Ratio

X34 - Quick Ratio: Acid Test

X35 - Interest Expense Ratio: Interest Expenses/Total Revenue

X36 - Total debt/Total net worth: Total Liability/Equity Ratio

X37 - Debt ratio %: Liability/Total Assets

X38 - Net worth/Assets: Equity/Total Assets

X39 - Long-term fund suitability ratio (A): (Long-term Liability+Equity)/Fixed Assets

X40 - Borrowing dependency: Cost of Interest-bearing Debt

X41 - Contingent liabilities/Net worth: Contingent Liability/Equity

X42 - Operating profit/Paid-in capital: Operating Income/Capital

X43 - Net profit before tax/Paid-in capital: Pretax Income/Capital

X44 - Inventory and accounts receivable/Net value: (Inventory+Accounts Receivables)/Equity

X45 - Total Asset Turnover

X46 - Accounts Receivable Turnover

X47 - Average Collection Days: Days Receivable Outstanding

X48 - Inventory Turnover Rate (times)

X49 - Fixed Assets Turnover Frequency

X50 - Net Worth Turnover Rate (times): Equity Turnover

X51 - Revenue per person: Sales Per Employee

X52 - Operating profit per person: Operation Income Per Employee

X53 - Allocation rate per person: Fixed Assets Per Employee

X54 - Working Capital to Total Assets

X55 - Quick Assets/Total Assets

X56 - Current Assets/Total Assets

X57 - Cash/Total Assets

X58 - Quick Assets/Current Liability

X59 - Cash/Current Liability

X60 - Current Liability to Assets

X61 - Operating Funds to Liability

X62 - Inventory/Working Capital

X63 - Inventory/Current Liability

X64 - Current Liabilities/Liability

X65 - Working Capital/Equity

X66 - Current Liabilities/Equity

X67 - Long-term Liability to Current Assets

X68 - Retained Earnings to Total Assets

X69 - Total income/Total expense

X70 - Total expense/Assets

X71 - Current Asset Turnover Rate: Current Assets to Sales

X72 - Quick Asset Turnover Rate: Quick Assets to Sales

X73 - Working capitcal Turnover Rate: Working Capital to Sales

X74 - Cash Turnover Rate: Cash to Sales

X75 - Cash Flow to Sales

X76 - Fixed Assets to Assets

X77 - Current Liability to Liability

X78 - Current Liability to Equity

X79 - Equity to Long-term Liability

X80 - Cash Flow to Total Assets

X81 - Cash Flow to Liability

X82 - CFO to Assets

X83 - Cash Flow to Equity

X84 - Current Liability to Current Assets

X85 - Liability-Assets Flag: 1 if Total Liability exceeds Total Assets, 0 otherwise

X86 - Net Income to Total Assets

X87 - Total assets to GNP price

X88 - No-credit Interval

X89 - Gross Profit to Sales

X90 - Net Income to Stockholder's Equity

X91 - Liability to Equity

X92 - Degree of Financial Leverage (DFL)

X93 - Interest Coverage Ratio (Interest expense to EBIT)

X94 - Net Income Flag: 1 if Net Income is Negative for the last two years, 0 otherwise

X95 - Equity to Liability

Importing the libraries

Data Preparation

Importing Dataset and examining the properties of the dataset

There are around 96 columns in the dataset with 93 of them being float and 3 being int.

Data set has 96 columns and 6819 rows

Checking for categorical Variables

Net Income Flag has only one unique value which is 1 in it

Checking for Missing values

We can see that there are no missing values or nan values in the dataset

There is no missing data

Identifying numerical columns

  1. Only Net Income Flag is a only categorical feature
  2. All the other features are numerical

are having high Magnitude compared to other features at .99 percentile

Feature Elimination : from correlation (Pearson correlation)

We are not able to compare how the features are correlated to each other

Pearson correlation

Checkpoint 1

  1. Data is checked with missing values
  2. Identified that outliers are present in the dataset from descriptive statistics chart and percentile distribution
  3. Feature correlation between each other feature above .85 is eliminated using (Pearson Correlation)

EDA + Outlier Analysis and Noise Removal and Visualization

Checking skewness of all the features in the form of distribution

  1. Instead of checking all the features one by one its a lot of features
  2. So lets handle the features based on the range of skewness of feature

Above we can see top 5 positive skewed features and top 5 negative skewed features

  1. Not all the features are having its skewness near to zero
  2. We have features that are highly skewed both in positive and negative direction
  3. Range of skewness is from -71 to +82

Lets start analyzing features that skewed from right to left

NOTE This is not any re-sampling technique as it has to be performed after splitting the dataset

Intension of this Analysis of skewness of features is to remove the noise at the end or beginning of the distribution Like values above .99 percentile and below .01 percentile that are very peculiar or uneven

The illustration of purpose of this intension can be seen below while removing noise from features that have skewness from range 70 to 85

Analysing Feature from skewness 70 to 85

Features Between the skewness range 70 to 85 :

  1. Realized Sales Gross Profit Growth Rate
  2. Contingent liabilities/Net worth
  3. Net Value Growth Rate
  4. Total income/Total expense
  5. Current Ratio
  6. Fixed Assets to Assets

These features contain very uneven values at the end or the beginning These uneven values are noise that effect the model performance and are stubborn to any kind of transformation

All these features starts its distribution with 0 as its 0th percentile

  1. Realized Sales Gross Profit Growth Rate contains => 99 percent value below .02 But having 1 as its 100th percentile which is a noise
  2. Simillary in Contingent liabilities/Net worth => 99 percentile is 0.01 and 100th percentile which is 1 (noise)
  3. Net Value Growth Rate => 99 percentile is 0.001 and 100th percentile is 9330000000 which is a (noise)
  4. Total income/Total expensee => 99 percentile is 0.003721 and 100th percentile is 1 (noise)
  5. Current Ratio=> 99 percentile is 0.074595 and 100th percentile is 2750000000 (noise)
  6. Fixed Assets to Assets => 99 percentile is 0.79 and 100th percentile is 8320000000 (noise)

Analysing Features having skewness from 60 to 70

  1. Total Asset Return Growth Rate Ratio
  2. Continuous Net Profit Growth Rate

Analysing Features having skewness from 45 to 61

  1. Inventory/Working Capital
  2. Degree of Financial Leverage (DFL)
  3. Total debt/Total net worth
  4. Quick Assets/Current Liability
  5. Revenue per person

Analysing Features having skewness from 30 to 44

  1. Average Collection Days
  2. Quick Ratio
  3. Equity to Long-term Liability
  4. Non-industry income and expenditure/revenue
  5. Revenue Per Share (Yuan ¥)

Analysing Features having skewness from 20 to 30

  1. Borrowing dependency
  2. Total assets to GNP price
  3. Long-term fund suitability ratio (A)
  4. Accounts Receivable Turnover
  5. Allocation rate per person

Analysing Features having skewness from 13 to 20

  1. Inventory and accounts receivable/Net value
  2. Current Liability to Current Assets
  3. Cash/Current Liability
  4. Cash Flow to Equity

Analysing Features having skewness from 7 to 10

  1. Displaying Variance
  2. Interest-bearing debt interest rate
  3. Equity to Liability
  4. Operating profit per person
  5. Cash Flow Per Share
  6. Net Worth Turnover Rate (times)
  7. Total expense/Assets

Analysing Features having skewness from 4 to 7

  1. Net Value Per Share (B)
  2. Persistent EPS in the Last Four Seasons

Skewness of features from -1 to 4 is been ignored as they are less skewed and they might be handled later after splitting the dataset

Analysing Features having skewness from -10 to -2

  1. Operating Gross Margin

Analysing Features having skewness from -40 to - 20

  1. Net Income to Stockholder's Equity
  2. Working Capital/Equity
  3. Working capitcal Turnover Rate
  4. After-tax Net Profit Growth Rate

Analysing Features having skewness from -85 to -50

  1. Operating Profit Growth Rate
  2. Operating Profit Rate

Checkpoint 2

Splitting data

NOTE!!! The variable [bank_data_stage_2] is a training set (X_train y_train) combined

1. Training Set:

    a. Target 1 : 176
    b. Target 0 :4996

2. Testing Set:

    a. Target 1 : 44
    b. Target 0 :1250

Hence the dataset is split we can perform any sampling method on training dataset

Feature Selection :Select the best features and rank them in order using

(Select KFold Technique)

We can use this ranked order features and sort them from high rank to rank or high score to low score and checking the model performance for each Addition of features starting from high to low

NOTE!!! This can be also used for tuning the model and checking the model performance

For Example:

Feature 1 Feature 2 Feature 3 Feature 4 Target
1.2 3.5 .843 .9.6 1
1.6 1.5 0.43 .9.5 0
5.9 95 0.5 .9.8 0
3.5 9.6 .3 .0.23 1
1.By using Select K Best Method we calculate score of features that helps model in prediction

Features Score
Feature 1 5
Feature 2 25
Feature 3 10
Feature 415

2.We make a list of features with calculated scores
3.We sort the list based on ranking or scores

Features Score
Feature 2 25
Feature 4 15
Feature 3 10
Feature 15

4.And We add one by one features from high score to low score to calculate the model performance

[Feature 2]----------------------------------------------check model performance=====> score 1=.56
[Feature 2]+[Feature 4]----------------------------------check model performance =====> score 2=.65
[Feature 2]+[Feature 4]+[Feature 3]---------------------check model performance =====> score 3=.75
[Feature 2]+[Feature 4]+[Feature 3]+[Feature 1]----------check model performance =====> score 4=.32

RESULTS!! Score 3 is higher and we would select sequence of Feature 2 , Feature 4 and Feature 3 more best performance

5.We can use this a parameter to check the optimal sequence and limit of highest score to get best scores

6.And best score we get by checking the features selected in the model having maximum scores

columns score p_value
0 Retained Earnings to Total Assets 403.773118 0.0000
1 Debt ratio % 382.439927 0.0000
2 Net worth/Assets 382.439927 0.0000
3 ROA(C) before interest and depreciation befor... 372.487581 0.0000
4 Persistent EPS in the Last Four Seasons 310.063059 0.0000
5 Cash/Current Liability 286.252677 0.0000
6 Current Liability to Current Assets 284.595750 0.0000
7 Working Capital to Total Assets 247.313865 0.0000
8 Current Liability to Assets 225.122244 0.0000
9 Borrowing dependency 205.120464 0.0000
10 Net Income to Stockholder's Equity 191.725273 0.0000
11 Net Value Per Share (B) 172.137867 0.0000
12 Liability-Assets Flag 146.025382 0.0000
13 Equity to Long-term Liability 145.974589 0.0000
14 Working Capital/Equity 128.954106 0.0000
15 Non-industry income and expenditure/revenue 110.983253 0.0000
16 Total expense/Assets 90.930228 0.0000
17 Total income/Total expense 85.334556 0.0000
18 Operating Gross Margin 74.137990 0.0000
19 Operating profit per person 64.427347 0.0000
20 Tax rate (A) 62.685816 0.0000
21 Cash/Total Assets 57.975673 0.0000
22 Total assets to GNP price 57.198236 0.0000
23 Quick Assets/Total Assets 52.781185 0.0000
24 Current Ratio 51.061844 0.0000
25 Quick Assets/Current Liability 45.741873 0.0000
26 CFO to Assets 43.589989 0.0000
27 Equity to Liability 42.928794 0.0000
28 Fixed Assets Turnover Frequency 39.380221 0.0000
29 Inventory and accounts receivable/Net value 38.909708 0.0000
30 Operating Profit Rate 38.850426 0.0000
31 Cash flow rate 36.832999 0.0000
32 Total Asset Turnover 34.540194 0.0000
33 Contingent liabilities/Net worth 32.300455 0.0000
34 Average Collection Days 28.537532 0.0000
35 Fixed Assets to Assets 28.537532 0.0000
36 Total debt/Total net worth 28.537532 0.0000
37 Allocation rate per person 28.537532 0.0000
38 Accounts Receivable Turnover 28.537532 0.0000
39 Revenue per person 28.537532 0.0000
40 Net Value Growth Rate 28.537532 0.0000
41 Quick Ratio 28.537532 0.0000
42 Total Asset Return Growth Rate Ratio 28.404717 0.0000
43 Cash Flow Per Share 26.275563 0.0000
44 Working capitcal Turnover Rate 25.605834 0.0000
45 Operating Profit Growth Rate 21.745893 0.0000
46 Cash Flow to Total Assets 20.771819 0.0000
47 Current Assets/Total Assets 19.982075 0.0000
48 Revenue Per Share (Yuan ¥) 18.624912 0.0000
49 Cash Flow to Equity 16.779115 0.0000
50 Total Asset Growth Rate 14.101149 0.0002
51 Cash Reinvestment % 11.383942 0.0007
52 Cash Flow to Liability 10.767865 0.0010
53 After-tax Net Profit Growth Rate 9.833110 0.0017
54 Long-term fund suitability ratio (A) 8.151000 0.0043
55 Research and development expense rate 6.270324 0.0123
56 Cash Turnover Rate 4.142295 0.0419
57 Quick Asset Turnover Rate 3.168890 0.0751
58 Continuous Net Profit Growth Rate 3.054471 0.0806
59 Current Liabilities/Liability 2.525012 0.1121
60 Degree of Financial Leverage (DFL) 1.969514 0.1606
61 Current Asset Turnover Rate 1.738746 0.1874
62 Net Worth Turnover Rate (times) 1.572778 0.2099
63 Realized Sales Gross Profit Growth Rate 1.315570 0.2514
64 Interest Coverage Ratio (Interest expense to ... 0.341999 0.5587
65 Long-term Liability to Current Assets 0.259598 0.6104
66 Inventory/Current Liability 0.187725 0.6648
67 No-credit Interval 0.147786 0.7007
68 Interest-bearing debt interest rate 0.066144 0.7970
69 Operating Expense Rate 0.035123 0.8513
70 Inventory/Working Capital 0.014130 0.9054
71 Interest Expense Ratio 0.010103 0.9199
72 Inventory Turnover Rate (times) 0.005834 0.9391

1. Treating standard scale value IQR formula i.e 1.5 of as an hyperparameter while handling outliers by varying it from 1.5 ,1.6,1.7 and so on....

  1. Normally we use standard value 1.5 as scale:
  2. Removing all the value outlier according to 1.5 scale we end up removing 90% of data and nothing is left
  3. We always have to bare with the outliers, but some times we can eliminate some of outliers , feature by feature

Problem: But!! dont know how much outliers on feature basis we cant eliminate

2. Solution:

And!! Now we got an another hyper parameter that is handling outlier for numbers for features, handling from feature that has less to outliers occurrence till the feature that has maximum outlier occurence

For We have two hyper parameter to tune

  1. IQR Scale value
  2. High Rank to low rank Feature Addition techinque

Only data point of majority class is been fed Because there is no compromise for modifying data of minority class

Suppose you want to select first 15 features for eliminating outliers with IQR scale range from (1.5 to 3.0)

We also have the list of columns with ranking of features which we have from SelectK technique

Models

  1. XGboost
  2. Random Forest
  3. Gradient Boost

Tuning data by IQR scale for XGBoost model

  1. F1 score - 0.529
  2. Precision - 0.42
  3. Recall - 0.70
  1. Hence treating the data with the IQR scale value 3.8 and storing in variable data_xgb
  2. data_xgb : This is a data with outlier handled suitable for model XGBoost

1. Tuning model by Feature Selection for XGBoost Model

(adding one by one higly ranked features from top  and checking if there is a raise in model performance )
  1. F1 score - 0.529
  2. Precision - 0.42
  3. Recall - 0.70
  1. F1 score - 0.545
  2. Precision - 0.42
  3. Recall - 0.75

From this we can see that there is a gradual raise in F1-score and Recall score while the precision being constant

Hypertuning XGBoost model with model parameters to check the improvement in performance

Tuning data for Random Forest model for IQR scale .

  1. F1 score - 0.46
  2. Precision - 0.32
  3. Recall - 0.79
  1. Hence treating the data with the IQR scale value 2.3 and storing in variable data_rf
  2. data_rf : This is a data with outlier handled suitable for model Random Forest

Tuning model by Feature Selection for Random Forest Model

(adding one by one higly ranked features from top  and checking if there is a raise in model performance )
  1. F1 score - 0.44 approx
  2. Precision - 0.33 approx
  3. Recall - 0.75 approx
  1. F1 score - 0.45 approx
  2. Precision - 0.34 approx
  3. Recall - 0.77 approx

There is no much increase in the performance of model by from feature selection method
CAUTION Performance of XGBoost fluctuates a lot for the same value of parameters

Hypertuning Random Forest with model parameters to check the improvement in performance

Tuning data for GradientBoost model for IQR scale from 2.3 to 3.9

  1. F1 -score .40 (approx)
  2. Precision - 0.26 (approx)
  3. Recall - 0.84 (approx)

Tuning model by Feature Selection for Gradient Boost Model

Gradient boost Model have same performance after tuning with high ranked feature selection method

Hypertuning GradientBoost with model paramaters

Finally gathering the best parameter tuned and comparing the models performance

Parameters considered for tuning model:

  1. IQR Scale
  2. Feature Selection Method (Adding features from highest rank to lowest rank : Rank is obtained by Select K Best Technique)
  3. Model Parameter tuning

Models Used:

1. XGBoost
2. Gradient Boost
3. Random Forest

We will consider two models

  1. XGBoost
  2. Random Forest

    Why Not Gradient Descent Because Gradient Boost have very good recall value but the Precision Recall trade-off is unfair

1. XGBoost (Best)

Best Parameters

  1. IQR Scale => 3.8 i.e. : Taking Variables between (Quantile_1 - 3.8 x IQR ) and (Quantile_3 + 3.8 x IQR)
  2. Rank Wise one by one Feature Addition Method => 52 : Adding first 52 ranked features

2. Random Forest (Best)

Best Parameters

  1. IQR Scale => 2.5 i.e. : Taking Variables between (Quantile_1 - 2.5 x IQR ) and (Quantile_3 + 2.5 x IQR)
  2. Rank Wise one by one Feature Addition Method => 69 : Adding first 69 ranked features

Random Forest Model Performace

ROC Curve of models

1. XGBoost 
2. Random Forest

  1. Deciding the model performance based on ROC and AUC curve in case of imbalanced dataset is a bad practice AUC and ROC curve is not enough
  2. Because AUC and ROC curve is insensitive to the imbalanced dataset, and will obviously show good score in case bad models also

Precision Recall Curve (Percision Recall Trade Off)

1. XGBoost 
2. Random Forest
  1. The Critical situation is that company getting bankrupted i.e it might be risk for stake holders and other investors so we need to max try that our model not to mis-classify the Minority class i.e company being bankrupted

    • Why because

    Situation 1 If a Majority class i.e company not being bankrupt is misclassified as bankrupt, the investors or stake holders might not select that mis-classified company even though in reality if company dint get bankrupted, This will also be a loss but not as much as Situation-2

    Situation-2 - This situation is a nightmare for people who put lots of money on the company (stakeholders or investors) i.e If a minority class that is company being bankrupted is been mis-classified as not being bankrupt then in case if one invested money on that company and in reality the company gets bankrupt this would incur a lot of money loss and time loss as it is a Failure with total wrong deviation, Yes Obviously its a very bad situation compared to Situation-1

    ## Recall expresses the ability to find all relevant instances in a dataset, precision expresses the proportion of the data points our model says was relevant actually were relevant.

    • If we want to avoid Situation 1 : Then we need to concentrate on High precision models
    • If we want to avoid Situation 2 : Then we need to concentrate on High recall models

Models: XGBoost and RandomForest gives best recalls about 75% classifying 75 % of company being bankrupted